Worksheet 2: Exploring Two Dimensional Data

Import the Libraries

For this exercise, we will be using:

Pandas (http://pandas.pydata.org/pandas-docs/stable/)
Numpy (https://docs.scipy.org/doc/numpy/reference/)
Matplotlib (http://matplotlib.org/api/pyplot_api.html)



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Exercise 1: Reading various forms of JSON Data

In the /data/ folder, you will find a series of .json files called dataN.json, numbered 1-4. Each file contains the following data:

	birthday	first_name	last_name
0	5\/3\/67	Robert	Hernandez
1	8\/4\/84	Steve	Smith
2	9\/13\/91	Anne	Raps
3	4\/15\/75	Alice	Muller



In [47]:

    
#Your code here...
file1 = pd.read_json('../../data/data1.json')
file2 = pd.read_json('../../data/data2.json')
file2 = pd.read_json('../../data/data2.json')
file3 = pd.read_json('../../data/data3.json') # add orient=columns
file4 = pd.read_json('../../data/data4.json', orient='split')
combined = pd.concat([file1, file2.T, file3, file4], ignore_index=True)
combined









    Out[47]:







  
    
      
      birthday
      first_name
      last_name
    
  
  
    
      0
      5\/3\/67
      Robert
      Hernandez
    
    
      1
      8\/4\/84
      Steve
      Smith
    
    
      2
      9\/13\/91
      Anne
      Raps
    
    
      3
      4\/15\/75
      Alice
      Muller
    
    
      4
      5\/3\/67
      Robert
      Hernandez
    
    
      5
      8\/4\/84
      Steve
      Smith
    
    
      6
      9\/13\/91
      Anne
      Raps
    
    
      7
      4\/15\/75
      Alice
      Muller
    
    
      8
      5\/3\/67
      Robert
      Hernandez
    
    
      9
      8\/4\/84
      Steve
      Smith
    
    
      10
      9\/13\/91
      Anne
      Raps
    
    
      11
      4\/15\/75
      Alice
      Muller
    
    
      12
      5\/3\/67
      Robert
      Hernandez
    
    
      13
      8\/4\/84
      Steve
      Smith
    
    
      14
      9\/13\/91
      Anne
      Raps
    
    
      15
      4\/15\/75
      Alice
      Muller

Exercise 2:

In the data file, there is a webserver file called hackers-access.httpd. For this exercise, you will use this file to answer the following questions:

Which browsers are the top 10 most used browsers in this data?
Which are the top 10 most used operating systems?

In order to accomplish this task, do the following:

Write a function which takes a User Agent string as an argument and returns the relevant data. HINT: You might want to use python's user_agents module, the documentation for which is available here: (https://pypi.python.org/pypi/user-agents)
Next, apply this function to the column which contains the user agent string.
Store this series as a new column in the dataframe
Count the occurances of each value in the new columns



In [48]:

    
import apache_log_parser
from user_agents import parse


def parse_ua(line):
    parsed_data = parse(line)
    return str(parsed_data).split('/')[1]

def parse_ua_2(line):
    parsed_data = parse(line)
    return str(parsed_data).split('/')[2]

#Read in the log file
line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"")

server_log = open("../../data/hackers-access.httpd", "r")
parsed_server_data = []
for line in server_log:
    data = {}
    data = line_parser(line)
    parsed_server_data.append( data )

server_df = pd.DataFrame(parsed_server_data)
server_df['OS'] = server_df['request_header_user_agent'].apply(parse_ua)
server_df['Browser'] = server_df['request_header_user_agent'].apply(parse_ua_2)
server_df['OS'].value_counts().head(10)









    Out[48]:





 Windows 7               2041
 Windows Vista            500
 Windows XP               423
 Windows 8.1              221
 Linux                    125
 Mac OS X 10.10.1          66
 Chrome OS 6310.68.0       60
 Mac OS X 10.8.4           12
 Ubuntu                     6
 Mac OS X 10.10.0           2
Name: OS, dtype: int64



In [ ]:

    
#Apply the functions to the dataframe



In [ ]:

    
#Get the top 10 values

Exercise 3:

Using the dailybots.csv film, read the file into a DataFrame and perform the following operations:

Filter the DataFrame to include bots from the Government/Politics Industry.
Calculate the ratio of hosts to orgs and add this as a column to the DataFrame and output the result
Calculate the total number of hosts infected by each BotFam in the Government/Politics Industry. You should use the groupby() function which is documented here: (http://pandas.pydata.org/pandas-docs/stable/groupby.html)



In [60]:

    
#Your code here...
bots = pd.read_csv('../../data/dailybots.csv')
gov_bots = bots[['botfam', 'hosts']][bots['industry'] == 'Government/Politics']
gov_bots.groupby(['botfam']).size()









    Out[60]:





botfam
Bedep          89
ConfickerAB    90
Necurs         87
Olmasco        11
PushDo         85
Ramnit         90
Sality         90
Zeus           90
Zusy           72
zeroaccess     89
dtype: int64



In [ ]: